Search CORE

90 research outputs found

Identifying statistical dependence in genomic sequences via mutual information estimates

Author: Aktulga HM
Grama AY
Kontoyiannis I
Lyznik LA
Szpankowski L
Szpankowski W
Publication venue
Publication date: 01/01/2007
Field of study

Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5' untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI's Combined DNA Index System (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats, an application of importance in genetic profiling.Comment: Preliminary version. Final version in EURASIP Journal on Bioinformatics and Systems Biology. See http://www.hindawi.com/journals/bsb

arXiv.org e-Print Archive

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

CUED - Cambridge University Engineering Department

Structural Complexity of Random Binary Trees

Author: E-h. Yang
J. C. Kieffer
W. Szpankowski
Publication venue
Publication date: 31/03/2010
Field of study

Abstract — For each positive integer n, let Tn be a random rooted binary tree having finitely many vertices and exactly n leaves. We can view H(Tn), the entropy of Tn, as a measure of the structural complexity of tree Tn in the sense that approximately H(Tn) bits suffice to construct Tn. We are interested in determining conditions on the sequence (Tn: n = 1, 2, · · ·) under which H(Tn)/n converges to a limit as n → ∞. We exhibit some of our progress on the way to the solution of this problem. I

CiteSeerX

Crossref

Data driven consistency (working title)

Author: Anantharam V.
Kavcic A.
Santhanam N.
Szpankowski W.
Publication venue
Publication date: 19/03/2021
Field of study

We are motivated by applications that need rich model classes to represent them. Examples of rich model classes include distributions over large, countably infinite supports, slow mixing Markov processes, etc. But such rich classes may be too complex to admit estimators that converge to the truth with convergence rates that can be uniformly bounded over the entire model class as the sample size increases (uniform consistency). However, these rich classes may still allow for estimators with pointwise guarantees whose performance can be bounded in a model dependent way. The pointwise angle of course has the drawback that the estimator performance is a function of the very unknown model that is being estimated, and is therefore unknown. Therefore, even if the estimator is consistent, how well it is doing may not be clear no matter what the sample size is. Departing from the dichotomy of uniform and pointwise consistency, a new analysis framework is explored by characterizing rich model classes that may only admit pointwise guarantees, yet all the information about the model needed to guage estimator accuracy can be inferred from the sample at hand. To retain focus, we analyze the universal compression problem in this data driven pointwise consistency framework.Comment: Working paper. Please email authors for the current versio

arXiv.org e-Print Archive

Exploiting a Computation Reuse Cache to Reduce Energy in Network Processors

Author: P. Gupta
W. Doeringer
W. Szpankowski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

Crossref

Leadership Statistics in Random Structures

Author: Ben-Naim E.
Bollobás B.
Bouchaud J.-P.
E Ben-Naim
Ellis R. E.
Flory P. J.
Galambos J.
Janson S.
Knuth D. E.
Mahmoud H. M.
P. L Krapivsky
Smoluchowski M. V.
Szpankowski W.
Publication venue: 'IOP Publishing'
Publication date: 30/07/2003
Field of study

The largest component (``the leader'') in evolving random structures often exhibits universal statistical properties. This phenomenon is demonstrated analytically for two ubiquitous structures: random trees and random graphs. In both cases, lead changes are rare as the average number of lead changes increases quadratically with logarithm of the system size. As a function of time, the number of lead changes is self-similar. Additionally, the probability that no lead change ever occurs decays exponentially with the average number of lead changes.Comment: 5 pages, 3 figure

arXiv.org e-Print Archive

Crossref

EDP Sciences OAI-PMH repository (1.2.0)

Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array

Author: D Okanohara
J Fischer
J Fischer
J Fischer
J Kärkkäinen
J Kärkkäinen
J Kärkkäinen
J Sirén
JI Munro
JS Vitter
K Sadakane
K Sadakane
P Ferragina
P Ferragina
P Ferragina
R Dementiev
T Beller
T Kasai
U Manber
W Hon
W Szpankowski
Publication venue
Publication date: 01/01/2016
Field of study

The longest common prefix (LCP) array is a versatile auxiliary data structure in indexed string matching. It can be used to speed up searching using the suffix array (SA) and provides an implicit representation of the topology of an underlying suffix tree. The LCP array of a string of length

n

can be represented as an array of length

n

words, or, in the presence of the SA, as a bit vector of

2n

bits plus asymptotically negligible support data structures. External memory construction algorithms for the LCP array have been proposed, but those proposed so far have a space requirement of

O(n)

words (i.e.

O(n \log n)

bits) in external memory. This space requirement is in some practical cases prohibitively expensive. We present an external memory algorithm for constructing the

2n

bit version of the LCP array which uses

O(n \log \sigma)

bits of additional space in external memory when given a (compressed) BWT with alphabet size

\sigma

and a sampled inverse suffix array at sampling rate

O(\log n)

. This is often a significant space gain in practice where

\sigma

is usually much smaller than

n

or even constant. We also consider the case of computing succinct LCP arrays for circular strings

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Scaled penalization of Brownian motion with drift and the Brownian ascent

Author: B Roynette
B Roynette
C Profeta
I Karatzas
IV Denisov
J Bertoin
J Obłój
J Pitman
JP Imhof
K Yano
L Chaumont
LCG Rogers
LCG Rogers
M Rosenbaum
M Rosenbaum
M Yor
P Debs
RT Durrett
T Povel
U Schmock
W Szpankowski
Y Yano
Publication venue
Publication date: 29/05/2018
Field of study

We study a scaled version of a two-parameter Brownian penalization model introduced by Roynette-Vallois-Yor in arXiv:math/0511102. The original model penalizes Brownian motion with drift

h\in\mathbb{R}

by the weight process

{\big(\exp(\nu S_t):t\geq 0\big)}

where

\nu\in\mathbb{R}

and

\big(S_t:t\geq 0\big)

is the running maximum of the Brownian motion. It was shown there that the resulting penalized process exhibits three distinct phases corresponding to different regions of the

(\nu,h)

-plane. In this paper, we investigate the effect of penalizing the Brownian motion concurrently with scaling and identify the limit process. This extends a result of Roynette-Yor for the

{\nu<0,~h=0}

case to the whole parameter plane and reveals two additional "critical" phases occurring at the boundaries between the parameter regions. One of these novel phases is Brownian motion conditioned to end at its maximum, a process we call the Brownian ascent. We then relate the Brownian ascent to some well-known Brownian path fragments and to a random scaling transformation of Brownian motion recently studied by Rosenbaum-Yor.Comment: 32 pages; made additions to Section

arXiv.org e-Print Archive

Crossref

DigitalCommons@UConn

OpenCommons at University of Connecticut

Stability Analysis of Frame Slotted Aloha Protocol

Author: AG Pakes
BS Tsybakov
C Bordenave
FC Schoute
H Inaltekin
H Okada
H Wu
H-J Noh
Harald Vogt
J Jeon
J Sant
JE Wieselthier
JF Mertens
L Barletta
LG Robert
M Kaplan
M Mitzenmacher
N Johnson
RR Rao
S Ghez
S Ghez
SC Kompalli
V Naware
W Feller
W Szpankowski
X Liu
Y Zhu
ZG Prodanoff
Publication venue
Publication date: 17/09/2014
Field of study

Frame Slotted Aloha (FSA) protocol has been widely applied in Radio Frequency Identification (RFID) systems as the de facto standard in tag identification. However, very limited work has been done on the stability of FSA despite its fundamental importance both on the theoretical characterisation of FSA performance and its effective operation in practical systems. In order to bridge this gap, we devote this paper to investigating the stability properties of FSA by focusing on two physical layer models of practical importance, the models with single packet reception and multipacket reception capabilities. Technically, we model the FSA system backlog as a Markov chain with its states being backlog size at the beginning of each frame. The objective is to analyze the ergodicity of the Markov chain and demonstrate its properties in different regions, particularly the instability region. By employing drift analysis, we obtain the closed-form conditions for the stability of FSA and show that the stability region is maximised when the frame length equals the backlog size in the single packet reception model and when the ratio of the backlog size to frame length equals in order of magnitude the maximum multipacket reception capacity in the multipacket reception model. Furthermore, to characterise system behavior in the instability region, we mathematically demonstrate the existence of transience of the backlog Markov chain.Comment: 14 pages, submitted to IEEE Transaction on Information Theor

arXiv.org e-Print Archive

The Number of Symbol Comparisons in QuickSort and QuickSelect

Author: B. Vallée
B. Vallée
D. Dolgopyat
D. Dolgopyat
D.E. Knuth
J. Clément
N.E. Nörlund
P. Flajolet
P. Grabner
P. Kirschenhofer
R. Sedgewick
V. Baladi
W. Szpankowski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

International audienceWe revisit the classical QuickSort and QuickSelect algo-rithms, under a complexity model that fully takes into account the ele-mentary comparisons between symbols composing the records to be pro-cessed. Our probabilistic models belong to a broad category of informa-tion sources that encompasses memoryless (i.e., independent-symbols) and Markov sources, as well as many unbounded-correlation sources. We establish that, under our conditions, the average-case complexity of QuickSort is O(n log 2 n) [rather than O(n log n), classically], whereas that of QuickSelect remains O(n). Explicit expressions for the implied constants are provided by our combinatorial–analytic methods

HAL - Normandie Université

Crossref

INRIA a CCSD electronic archive server

On the asymptotic joint distribution of height and width in random trees

Author: Aldous D.
Aldous D.
Aldous D.
Biane P.
Biane P.
Chassaing P.
Chung K. L.
Devroye L.
Donati-Martin C.
Drmota M.
Drmota M.
Flajolet P.
Graham R. L.
Janson S.
Jeulin T.
Kennedy D. P.
Louchard G.
Meir A.
Pitman J.
Rényi A.
Svante Janson
Szpankowski W.
Takács L.
Yor M.
Publication venue: 'Akademiai Kiado Zrt.'
Publication date
Field of study

Crossref